Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add median and count_distinct aggregation functions #278

Merged
merged 5 commits into from
Sep 5, 2022

Conversation

vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Aug 4, 2022

This PR includes yaml configs for median, approx_median and count_distinct for aggregate functions.

@vibhatha vibhatha marked this pull request as ready for review August 4, 2022 05:53
@vibhatha vibhatha force-pushed the aggregate_meadian_count branch from e641541 to 4a4921c Compare August 15, 2022 12:01
@vibhatha
Copy link
Contributor Author

@jvanstraten would it be better introduce an option like precision or and introduce EXACT and APPROXIMATE instead of writing two separate functions. I did this today for another PR. Not sure if it is a good way to handle it. WDYT?

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to update this.

would it be better introduce an option like precision or and introduce EXACT and APPROXIMATE instead of writing two separate functions

I don't really have an opinion, but unless there's a statistical definition I'm not aware of, I do believe we need to specify what "approximate" means. Is 100.0 an acceptable approximate median of [101.2, 101.3, 101.4]? Probably not, so where's the limit?

@@ -742,6 +742,105 @@ aggregate_functions:
- value: fp64
nullability: DECLARED_OUTPUT
return: fp64?
- name: "median"
description: Calculate the median for a set of values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should include something like "Returns null if applied to zero records. For the integer implementations, the rounding option determines how the median should be rounded if it ends up midway between two values. For the floating point implementations, they specify the usual floating point rounding mode."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the description?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jvanstraten jvanstraten changed the title feat: adding median, approx_median and count_distinct for aggregation functions feat: add median, approx_median and count_distinct for aggregation functions Sep 2, 2022
@jvanstraten
Copy link
Contributor

jvanstraten commented Sep 2, 2022

@vibhatha Please extend the descriptions. Here's a good example of the level of descriptiveness I'm looking for:

name: replace_slice
description: >-
Replace a slice of the input string. A specified 'length' of characters will be deleted from
the input string beginning at the 'start' position and will be replaced by a new string. A
start value of 1 indicates the first character of the input string. If start is negative
or zero, or greater than the length of the input string, a null string is returned. If 'length'
is negative, a null string is returned. If 'length' is zero, inserting of the new string
occurs at the specified 'start' position and no characters are deleted. If 'length' is
greater than the input string, deletion will occur up to the last character of the input string.
impls:
- args:
- value: "string"
name: "input"
description: Input string.
- value: i64
name: "start"
description: The position in the string to start deleting/inserting characters.
- value: i64
name: "length"
description: The number of characters to delete from the input string.
- value: "string"
name: "replacement"
description: The new string to insert at the start position.
return: "string"
- args:
- value: "varchar<L1>"
name: "input"
description: Input string.
- value: i64
name: "start"
description: The position in the string to start deleting/inserting characters.
- value: i64
name: "length"
description: The number of characters to delete from the input string.
- value: "varchar<L2>"
name: "replacement"
description: The new string to insert at the start position.
return: "varchar<L1>"
Better to be verbose and explicit than to assume the reader already knows what the function does, especially in odd corner cases, and in this case for what "approximate" actually means in terms of accuracy.

Also, IIRC we went for an option for approximate vs precise exact for other statistical functions? Not sure if they're merged yet but I remember that being the conclusion.

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 5, 2022

@jvanstraten we had a function for population and sample and I did write a option recently for precision. I am going to add it here as well. Let me push what I wrote, then we can probably re-word it and standardize to re-use.

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability. I am not sure if this is possible, but just a suggestion. Not asking about array of items, just

operator_options:
     - population
     - precision
     - ....

And the operator_options can be read from somewhere else. How about that?

@vibhatha vibhatha force-pushed the aggregate_meadian_count branch from 4c30d4a to d2dc8de Compare September 5, 2022 01:15
@vibhatha vibhatha requested review from jvanstraten and removed request for cpcloud September 5, 2022 01:15
Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability.

We've run into similar things with the YAMLs before, and the conclusion has always been that it's easier on consumers of the YAML files to just repeat everything. If at some point it becomes really annoying we could always just generate the YAML files. However, the descriptions of the standard options we use all over the core extensions (like overflow, rounding, and I guess precision and population) could IMO just be defined on the website instead of in the YAMLs. A computer that's parsing these YAML files won't care about the description anyway. There's no page for this yet, but I'm planning to rewrite/add to the extension pages of the website soon, and there will absolutely be a section for this. So I'm fine with just omitting the descriptions of precision for now.

on saving memory bandwidth, the precision of the end result can be
the highest possible accuracy of an approximation.

- EXACT: provides the highest accurate output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- EXACT: provides the highest accurate output
- EXACT: provides the exact result, rounded if needed according
to the rounding option

the highest possible accuracy of an approximation.

- EXACT: provides the highest accurate output
- APPROXIMATE: provides a sub-optimal output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still doesn't specify how approximate the result may be. Is this what you mean, or is this too broad?

Suggested change
- APPROXIMATE: provides a sub-optimal output
- APPROXIMATE: provides only an estimate; the result must lie
between the minimum and maximum values in the input
(inclusive), but otherwise the accuracy is left up to the
consumer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to mean this one, but may be we could elaborate this a bit?

Although, this is broad since the optimization strategy is very hard to specify. Again I think the choice of approximation is up to the engine to decide and that depends on various optimization techniques. But that's not the job of Substrait to define that optimization strategy. Substrait can only specify that it is going to be an approximation. So I wanted to put forward that idea.

Should we enhance more?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we can do better than what I suggested to constrain the approximation method further, but you're welcome to try. I at least can't think of any approximate median algorithm that doesn't at least satisfy that constraint, and only trivial constraints is better than no constraints, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the suggested once for now. I will update this.

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 5, 2022

I am also wondering can we reference such options under a general options attribute. Rather than re-writing the description and definitions over and over again. It could save spacing and readability.

We've run into similar things with the YAMLs before, and the conclusion has always been that it's easier on consumers of the YAML files to just repeat everything. If at some point it becomes really annoying we could always just generate the YAML files. However, the descriptions of the standard options we use all over the core extensions (like overflow, rounding, and I guess precision and population) could IMO just be defined on the website instead of in the YAMLs. A computer that's parsing these YAML files won't care about the description anyway. There's no page for this yet, but I'm planning to rewrite/add to the extension pages of the website soon, and there will absolutely be a section for this. So I'm fine with just omitting the descriptions of precision for now.

I get your point. I think this sounds good for now.

@vibhatha vibhatha force-pushed the aggregate_meadian_count branch from e7e68fc to 7db8984 Compare September 5, 2022 15:26
@vibhatha vibhatha requested review from jvanstraten and cpcloud and removed request for jvanstraten and cpcloud September 5, 2022 15:27
@jvanstraten jvanstraten changed the title feat: add median, approx_median and count_distinct for aggregation functions feat: add median and count_distinct aggregation functions Sep 5, 2022
@jvanstraten jvanstraten merged commit 9be62e5 into substrait-io:main Sep 5, 2022
@jacques-n
Copy link
Contributor

Sorry, just saw this come through. I think we should revert the addition of count distinct here. Distinct isn't generally a property of the function, it is a property of the aggregate. That's why we have distinct as a property of function invocation.

You achieve count distinct by combining count + that property.

@jvanstraten
Copy link
Contributor

Oops, you're right, I didn't think of that. #311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants